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We introduce new online and batch algo- 
rithms that are robust to data with missing 
features, a situation that arises in many prac- 
tical applications. In the online setup, we al- 
low for the comparison hypothesis to change 
as a function of the subset of features that is 
observed on any given round, extending the 
standard setting where the comparison hy- 
pothesis is fixed throughout. In the batch 
setup, we present a convex relaxation of a 
non-convex problem to jointly estimate an 
imputation function, used to fill in the values 
of missing features, along with the classifica- 
tion hypothesis. We prove regret bounds in 
the online setting and Rademacher complex- 
ity bounds for the batch i.i.d. setting. The al- 
gorithms are tested on several UCI datasets, 
showing superior performance over baseline 
imputation methods. 



1 Introduction 

Standard learning algorithms assume that each train- 
ing example is fully observed and doesn't suffer any 
corruption. However, in many real-life scenarios, train- 
ing and test data often undergo some form of cor- 
ruption. We consider settings where all the features 
might not be observed in every example, allowing for 
both adversarial and stochastic feature deletion mod- 
els. Such situations arise, for example, in medical 
diagnosis — predictions are often desired using only a 
partial array of medical measurements due to time or 
cost constraints. Survey data are often incomplete due 
to partial non-response of participants. Vision tasks 
routinely need to deal with partially corrupted or oc- 
cluded images. Data collected through multiple sen- 
sors, such as multiple cameras, is often subject to the 
sudden failure of a subset of the sensors. 



In this work, we design and analyze learning algo- 
rithms that address these examples of learning with 
missing features. The first setting we consider is on- 
line learning where both examples and missing features 
are chosen in an arbitrary, possibly adversarial, fash- 
ion. We define a novel notion of regret suitable to the 
setting and provide an algorithm which has a prov- 
ably bounded regret on the order of O(Vt), where 
T is the number of examples. The second scenario is 
batch learning, where examples and missing features 
are drawn according to a fixed and unknown distribu- 
tion. We design a learning algorithm which is guaran- 
teed to globally optimize an intuitive objective func- 
tion and which also exhibits a generalization error on 
the order of 0{yj d/T), where d is the data dimension. 

Both algorithms are also explored empirically across 
several publicly available datasets subject to various 
artificial and natural types of feature corruption. We 
find very encouraging results, indicating the efficacy 
of the suggested algorithms and their superior perfor- 
mance over baseline methods. 

Learning with missing or corrupted features has a long 
history in statistics [T31 [TO], and has recieved recent 
attention in machine learning [OJ 1151 [5J [7J. Imputa- 
tion methods (see [TU [T5l [10]) fill in missing values, 
generally independent of any learning algorithm, af- 
ter which standard algorithms can be applied to the 
data. Better performance might be expected, though, 
by learning the imputation and prediction functions 
simultaneously. Previous works [15] address this issue 
using EM, but can get stuck in local optima and do 
not have strong theoretical guarantees. Our work also 
is different from settings where features are missing 
only at test time [5J [TTJ , settings that give access to 
noisy versions of all the features [fr or settings where 
observed features are picked by the algorithm [5]- 

Section[2]introduces both the general online and batch 
settings. Sections [3] and [4] detail the algorithms and 
theoretical results within the online and batch settings 
resp. Empirical results are presented in Section [5] 



2 The Setting 

In our setting it will be useful to denote a training 
instance xt G R d and prediction yt, as well as a cor- 
ruption vector zt G {0, l} d , where 

I , f if feature i is not observed, 
^ tJt \ \ if feature i is observed. 

We will discuss as specific examples both classification 
problems where yt G { — 1,1} and regression problems 
where y t G R. The learning algorithm is given the 
corruption vector z t as well as the corrupted instance, 

Xj = x t o z t , 

where o denotes the component-wise product between 
two vectors. Note that the training algorithm is never 
given access to x t , however it is given z t , and so has 
knowledge of exactly which coordinates have been cor- 
rupted. The following subsections explain the online 
and batch settings respectively, as well as the type of 
hypotheses that are considered in each. 

2.1 Online learning with missing features 

In this setting, at each time-step t the learning algo- 
rithm is presented with an arbitrarily (possibly ad- 
versarially) chosen instance (x' t ,z t ) and is expected to 
predict y t . After prediction, the label is then revealed 
to the learner which then can update its hypothesis. 

A natural question to ask is what happens if we simply 
ignore the distinction between xj and x< and just run 
an online learning algorithm on this corrupted data. 
Indeed, doing so would give a small bound on regret: 

T T 
R(T, t)=Y J *«w t) x' t ), yt) - inf £ £((w, x' t ), y t ) , 

£=1 t=l 

(i) 

with respect to a convex loss function £ and for any 
convex compact subset W C R d . However, any fixed 
weight vector w in the second term might have a very 
large loss, making the regret guarantee useless — both 
the learner and the comparator have a large loss mak- 
ing the difference small. For instance, assume one 
feature perfectly predicts the label, while another one 
only predicts the label with 80% accuracy, and I is the 
quadratic loss. It is easy to see that there is no fixed 
w that will perform well on both examples where the 
first feature is observed and examples where the first 
feature is missing but the second one is observed. 

To address the above concerns, we consider using a 
linear corruption- dependent hypothesis which is per- 
mitted to change as a function of the observed cor- 
ruption z t . Specifically, given the corrupted instance 



and corruption vector, the predictor uses a function 
Wt(-) : {0, l} d — > R d to choose a weight vector, and 
makes the prediction y t = (w f (z t ), xj). In order to 
provide theoretical guarantees, we will bound the fol- 
lowing notion of regret, 

T T 

R Z (T, t) =J>«w t) x' t ), y t )-inf ^«w(z t ), xj), yt), 
t=i w t=i 

(2) 

where it is implicit that w t also depends on z t and 
W now consists of corruption-dependent hypotheses. 
Similar definitions of regret have been looked at in the 
setting learning with side information [5J [T^ , but our 
special case admits stronger results in terms of both 
upper and lower bounds. In the most general case, 
we may consider W as the class of all functions which 
map {0, l} d -> R d , however we show this can lead 
to an intractable learning problem. This motivates 
the study of interesting subsets of this most general 
function class. This is the main focus of Section [3] 

2.2 Batch learning with missing features 

In the setup of batch learning with i.i.d. data, examples 
(x t , z t ,yt) are drawn according to a fixed but unknown 
distribution and the goal is to choose a hypothesis that 
minimizes the expected error, with respect to an ap- 
propriate loss function i: E XtjZti2/t [£(/i(x t , z t ), y t )]. 

The hypotheses h we consider in this scenario will 
be inspired by imputation-based methods prevalent 
in statistics literature used to address the problem of 
missing features |14j . An imputation mapping is a 
function used to fill in unobserved features using the 
observed features, after which the completed examples 
can be used for prediction. In particular, if we consider 
an imputation function <p : R d x {0, l} d — > R d , which 
is meant to fill missing feature values, and a linear pre- 
dictor w G R d , we can parameterize a hypothesis with 
these two function ft^, jW (xj, Z() = (w, <£(xj, z t )). 

It is clear that the multiplicative interaction between 
w and <p will make most natural formulations non- 
convex, and we elaborate more on this in Section |4] 
In the i.i.d. setting, the natural quantity of interest is 
the generalization error of our learned hypothesis. We 
provide a Rademacher complexity bound on the class 
of w, 4> pairs we use, thereby showing that any hypoth- 
esis with a small empirical error will also have a small 
expected loss. The specific class of hypotheses and 
details of the bound are presented in Section 2J Fur- 
thermore, the reason as to why an imputation-based 
hypothesis class is not analyzed in the more general ad- 
versarial setting will also be explained in that section. 



3 Online Corruption-Based Algorithm 

In this section, we consider the class of corruption- 
dependent hypotheses defined in Section 12.11 Recall 
the definition of regret ([2]), which we wish to control 
in this framework, and of the comparator class of func- 
tions W C {0, l} d R d . It is clear that the function 
class W is much richer than the comparator class in 
the corruption-free scenario, where the best linear pre- 
dictor is fixed for all rounds. It is natural to ask if it is 
even possible to prove a non-trivial regret bound over 
this richer comparator class W. In fact, the first result 
of our paper provides a lower bound on the minimax 
regret when the comparator is allowed to pick arbi- 
trary mappings, i.e. the set W contains all mappings. 
The result is stated in terms of the minimax regret 
under the loss function £ under the usual (corruption- 
free) definition (p}: 

R*(T,l)= inf sup ••• inf sup R(T,£) 

Wl£VV (xi,*i,tfi) w r6W(x T ,z T ,y T ) 

Proposition 1 If W = {0, l} d W l the minimax 
value of the corruption dependent regret for any loss 
function £ is lower bounded as 

inf sup ■•• inf sup R Z (T : £) 
wieW (xiiZliJ/l) w T ew (xr Zt Vt) 

This proposition (the proof of which appears in the 
appendix |17| ) shows that the minimax regret is lower 
bounded by a term that is exponential in the di- 
mensionality of the learning problem. For most non- 
degenerate convex and Lipschitz losses, R*(T,£) = 
Q(VT) without further assumptions (see e.g. pQ) 
which yields a f2(2 d / 4 v / T) lower bound. The bound 
can be further strengthened to fl(2 d ^ 2 VT) for lin- 
ear losses which is unimprovable since it is achieved 
by solving the classification problem corresponding to 
each pattern independently. 

Thus, it will be difficult to achieve a low regret against 
arbitrary maps from {0, l} d to M. d . In the follow- 
ing section we consider a restricted function class and 
show that a mirror-descent algorithm can achieve re- 
gret polynomial in d and sub- linear in T, implying that 
the average regret is vanishing. 

3.1 Linear Corruption-Dependent Hypotheses 

Here we analyze a corruption-dependent hypothesis 
class that is parametrized by a matrix A £ R dxfc , 
where k may be a function of d. In the simplest 
case of k = d, the parametrization looks for weights 
w(z f ) that depend linearly on the corruption vector 



z t . Defining wa(zj) = Az t achieves this, and intu- 
itively this allows us to capture how the presence or 
absence of one feature affects the weight of another 
feature. This will be clarified further in the examples. 

In general, the matrix A will be d x k, where k will be 
determined by a function ip(z t ) £ {0, l} k that maps z t 
to a possibly higher dimension space. Given, a fixed 
if), the explicit parameterization in terms of A is, 

WA,v( z t) = A.i/)(z t ) ■ (3) 

In what follows, we drop the subscript from wa,i/> in 
order to simplify notation. Essentially this allows us 
to introduce non-linearities as a function of the corrup- 
tion vector, but the non-linear transform is known and 
fixed throughout the learning process. Before analyz- 
ing this setting, we give a few examples and intuition 
as to why such a parametrization is useful. In each 
example, we will show how there exists a choice of a 
matrix A that captures the specific problem's assump- 
tions. This implies that the fixed comparator can use 
this choice in hindsight, and by having a low regret, 
our algorithm would implicitly learn a hypothesis close 
to this reasonable choice of A. 

3.1.1 Corruption- free special case 

We start by noting that in the case of no corruption 
(i.e. Vt, zt = 1) a standard linear hypothesis model can 
be cast within the matrix based framework by defining 
ip(z t ) = 1 and learning A £ R dxl . 

3.1.2 Ranking-based parameterization 

One natural method for classification is to order the 
features by their predictive power, and to weight fea- 
tures proportionally to their ranking (in terms of ab- 
solute value; that is, the sign of weight depends on 
whether the correlation with the label is positive or 
negative). In the corrupted features setting, this nat- 
urally corresponds to taking the available features at 
any round and putting more weight on the most pre- 
dictive observed features. This is particularly impor- 
tant while using margin-based losses such as the hinge 
loss, where we want the prediction to have the right 
sign and be large enough in magnitude. 

Our parametrization allows such a strategy when us- 
ing a simple function t/>(z f ) = z f . Without loss of 
generality, assume that the features are arranged in de- 
creasing order of discriminative power (we can always 
rearrange rows and columns of A if they're not). We 
also assume positive correlations of all features with 
the label; a more elaborate construction works for A 
when they're not. In this case, consider the parameter 



matrix and the induced classification weights 



[A], 



1, 3 = i 

■h 3<i , 
0, j>i 



[w(z t )]j = [Zi]j 1 



-E 

j<i: 



Thus, for all i < j such that [z t ]i — [z t ]j = 1 we 
have [w(z t )]i > [w(z t )]j. The choice of 1 for diagonals 
and l/d for off-diagonals is arbitrary and other val- 
ues might also be picked based on the data sequence 
(xt, Zj, yt). In general, features are weighted monoton- 
ically with respect to their discriminative power with 
signs based on correlations with the label. 

3.1.3 Feature group based parameterization 

Another class of hypotheses that we can define within 
this framework are those restricted to consider up to 
p-wise interactions between features for some constant 
< p < d. In this case, we index the k — X^Lj (T) = 
0((^) p ) unique subsets of features of size up to p. 
Then define [ip(z t )]j = 1 if the corresponding subset 
j is uncorrupted by z t and equal to otherwise. An 
entry [A]ij now specifies the importance of feature j, 
assuming that at least the subset i is present. Such a 
model would, for example, have the ability to capture 
the scenario of a feature that is only discriminative in 
the presence of some p— 1 other features. For example, 
we can generalize the ranking example from above to 
impose a soft ranking on groups of features. 

3.1.4 Corruption due to failed sensors 

A common scenario for missing features arises in ap- 
plications involving an array of measurements, for ex- 
ample, from a sensor network, wireless motes, array 
of cameras or CCDs, where each sensor is bound to 
fail occasionally. The typical strategy for dealing with 
such situations involves the use of redundancy. For 
instance, if a sensor fails, then some kind of an aver- 
aged measurement from the neighboring sensors might 
provide a reasonable surrogate for the missing value. 

It is possible to design a choice of A matrix for 
the comparator that only uses the local measurement 
when it is present, but uses an averaged approximation 
based on some fixed averaging distribution on neigh- 
boring features when the local measurement is miss- 
ing. For each feature, we consider a probability distri- 
bution pi which specifies the averaging weights to be 
used when approximating feature i using neighboring 
observations. Let w* be the weight vector that the 
comparator would like to use if all the features were 
present. Then, with i/>(z) = z and for j ^ i we define, 



Thus, say only feature k is missing, we still have 
x'jAz t = Y,ijK}i[zt}j[A}ij = E^,^jJxt]i[A]i,j = 
E^J x tM w *]i + KliEi/tl^]*. where by as- 
sumption Y^z^ki^iPki ~ [x t ] fe . 

Of course, the averaging in such applications is typ- 
ically local, and we expect each sensor to put large 
weights only on neighboring sensors. This can be spec- 
ified via a neighborhood graph, where nodes i and j 
have an edge if j is used to predict i when feature i 
is not observed and vice versa. From the construc- 
tion (]3|) it is clear that the only off-diagonal entries 
that are non-zero would correspond to the edges in 
the neighborhood graph. Thus we can even add this 
information to our algorithm and constrain several off- 
diagonal elements to be zero, thereby restricting the 
complexity of the problem. 

3.2 Matrix-Based Algorithm and Regret 

We use a standard mirror-descent style algorithm [TBI 
[3] in the matrix based parametrization described 
above. It is characterized by a strongly convex reg- 
ularizer K : R dxk K, that is 

71(A) > 71(B) + <V7l(B), A-B) F +i||A-B|| 2 \/A,BeA, 

for some norm II ■ || and where (A, B) F = Tr(A T B) 
is the trace inner product. An example is the squared 
Frobenius norm 71(A) = |||A|||,. For any such func- 
tion, we can define the associated Bregman divergence 

D K (A, B) = K(A) - K(B) - (Vft(B), A - B) F . 

We assume A is a convex subset of K dx ' c , which could 
encode constraints such as some off-diagonal entries 
being zero in the setup of Section 13.1.41 To simplify 
presentation in what follows, we will use the shorthand 
£ t (A) = £((Aip(z t ),x' t ),y t ). The algorithm initializes 
with any Ao € A and updates 

A i+1 -argmiri{? 7t (V£ t (A t ),A) i , + J D K (A,A 4 )} (5) 
Ae.4 

If A = M dxfe and TZ(A) = |||A|||, the update simpli- 
fies to gradient descent A t+1 = A t — rj t W £ t (A t ) . 

Our main result of this section is a guarantee on the 
regret incurred by Algorithm ([5]). The proof follows 
from standard arguments (see e.g. [IH|4]). Below, the 
dual norm is defined as ||V||» = supu^^^ (U, V)^. 

Theorem 1 Let 1Z be strongly convex with respect to a 
norm \\ ■ \\ and IjV^A)!!* < G, then Algorithm^ with 



learning rate rjt = 



R 

gVt 



exhibits the following regret 



upper bound compared to any A with \\A\\ < R, 



[A], 



■fjpji 



(4) 



T T 

Y]t{{ A t z t , x' t ) , y t ) -inf ^ 
t=l t=l 



e((Az t ,^),y t ) <3RGVT 



4 Batch Imputation Based Algorithm 

Recalling the setup of Section 12. 2[ in this section we 
look at imputation mappings of the form 



</> M (x', z) = x' + diag(l - z)M T x' 



(6) 



Thus we retain all the observed entries in the vector x', 
but for the missing features that are predicted using a 
linear combination of the observed features and where 
the ith column of M encodes the averaging weights for 
the ith feature. Such a linear prediction framework 
for features is natural. For instance, when the data 
vectors x are Gaussian, the conditional expectation of 
any feature given the other features is a linear function. 
The predictions are now made using the dot product 



(w,0(x',z)) = (w,x') + (w,diag(l 



z)M T x'' 



where we would like to estimate w,M based on the 
data samples. From a quick inspection of the result- 
ing learning problem, it becomes clear that optimiz- 
ing over such a hypothesis class leads to a non-convex 
problem. The convexity of the loss plays a critical role 
in the regret framework of online learning, which is 
why we restrict ourselves to a batch i.i.d. setting here. 

In the sequel we will provide a convex relaxation to 
the learning problem resulting from the parametriza- 
tion ([5]). While we can make this relaxation for natu- 
ral loss functions in both classification and regression 
scenarios, we restrict ourselves to a linear regression 
setting here as the presentation for that example is 
simpler due to the existence of a closed form solution 
for the ridge regression problem. 

In what follows, we consider only the corrupted data 
and thus simply denote corrupted examples as x^. Let 
X denote the matrix with i t h row equal to x^ and 
similarly define Z as the matrix with i t h row equal to 
Zi. It will also be useful to define Z = 11 T — Z and 
Zi = 1 Zi and finally let Zj = diag(z^). 

4.1 Imputed Ridge Regression (IRR) 

In this section we will consider a modified version of 
the ridge regression (RR) algorithm, robust to miss- 
ing features. The overall optimization problem we are 
interested in is as follows, 



mm — 1| w| 

{w,M:||M|| F <7} 2 



1 T - 

--^(y l -w T (x l + Z l M T x I )) 2 (7) 



where the hypothesis w and imputation matrix M 
are simultaneously optimized. In order to bound the 
size of the hypothesis set, we have introduced the con- 
straint ||M|||, < 7 2 that bounds the Frobenius norm 
of the imputation matrix. The global optimum of the 
problem as presented in ([7]) cannot be easily found as 



it is not jointly convex in both w and M. We next 
present a convex relaxation of the formulation ([7|). 
The key idea is to take a dual over w but not M, 
so that we have a saddle-point problem in the dual 
vector a and M. The resulting saddle point prob- 
lem, while being concave in a. is still not convex in M. 
At this step we introduce a new tensor N € M. dxdxd , 
where [NJij.fc = [MJ^/JM] Finally we drop the 
non-convex constraint relating M and N replacing it 
with a matrix positive semidefiniteness constraint. 

Before we can describe the convex relaxation, we need 
one more piece of notation. Given a matrix M and a 
tensor N, we define the matrix Kmn € E TxT 



■ x^MZ.Xj + x I T Z J M T x ;7 

d 

+ ^[z i ] fc [z,-] fc x l T N fc x J . (8) 



The following proposition gives the convex relaxation 
of the problem ([7]) that we refer to as Imputed Ridge 
Regression (IRR) and which includes a strictly larger 
hypothesis than the (w, M) pairs with which we be- 
gan. 

Proposition 2 The following semi-definite program- 
ming optimization problem provides a convex relax- 
ation to the non-convex problem 

t (9) 



mm 

t, M:\\M\\] 
N:£J|N fc || 



s.t. 



K MN + ATI 



y 

t 



ho, K 



MN 



>- 0. 



The proof is deferred to the appendix for lack of space. 
The main idea is to take the quadratic form that arises 
in the dual formulation of ([7J with the matrix Km, 

and relax it to the matrix Kmn ©• The constraint 
involving positive semidefiniteness of Kmn is needed 
to ensure the convexity of the relaxed problem. The 
norm constraint on N is a consequence of the norm 
constraint on M. 

One tricky issue with relaxations is using the relaxed 
solution in order to find a good solution to the orig- 
inal problem. In our case, this would correspond to 
finding a good w,M pair for the primal problem ([?])■ 
We bypass this step, and instead directly define the 
prediction on any point (xg,z ) as: 



i=l 



a^xjxo + x^MZjXo + x7ZoM T xo 

d 

+ ^[z l ] fe [z ] fc x i r N fc x ). (10) 



fc=i 



Here, a,M,N are solutions to the saddle-point prob- 
lem 



min max2a y— a (Kmn + ATI)**. (11) 

M:||M|| F <7 a 
N:E fc l|N fc |||<7 4 

We start by noting that the above optimization prob- 
lem is equivalent to the one in Proposition [5] The 
intuition behind this definition (|10[) is that the solu- 
tion to the problem (JT]) has this form, with [N]* jfc 
replaced with [MJ^pVlJ^fc. In the next section, we 
show a Rademacher complexity bound over functions 
of the form above to justify our convex relaxation. 

4.2 Theoretical analysis of IRR 

As mentioned in the previ ous section, we predict with 
a hypothesis of the form (ITU1) rather than going back 
to the primal class indexed by (w, M) pairs. In this 
section, we would like to show that the new hypothesis 
class parametrized by a, M, N is not too rich for the 
purposes of learning. To do this, we give the class of 
all possible hypotheses that can be the solutions to the 
dual problem ([9]) and then prove a Rademacher com- 
plexity bound over that class. The set of all possible 
a, M, N triples that can be potential solutions to (j9]) 
lie in the following set 



~H = <h(xo, zp)>->y~ai(x7xo + x7MZiXo + Xj r ZoM T x C) + 
I i=i 

d R 
£l»i]hN**i T N k xo) : ||M||,< 7 , ||N|| F < 7 2 , ||a|| < jj= 

The bound on ||a|| is made implicitly in the opti- 
mization problem (assuming the training labels are 
bounded Vi, \yi\ < B). To see this, we note that the 
problem © is obtained from (fTTj) by using the closed- 
form solution of the optimal a. = (Kmn + ATI) _1 y. 
Then we can bound ||a|| < ||y||/A ro j n (K]yiN + ATI) = 
, where A m i n (A) denotes the smallest eigenvalue 
of the matrix A. Note that in general there is no lin- 
ear hypothesis w that corresponds to the hypotheses 
in the relaxed class % and that we are dealing with 
a strictly more general function class. However, the 
following theorem demonstrates that the Rademacher 
complexity of this function class is reasonably bounded 
in terms of the number of training points T and dimen- 
sion d and thereby still provides provable generaliza- 
tion performance [2]. 

Recall the Rademacher complexity of a class T-L 



m T (H) = E S E C 



7f SUp 

1 hen 



T 

E 

i=l 



0"j/l(Xj 



(12) 



where the inner expectation is over independent 
Rademacher random variables (a±, . . . , <tt) and the 
outer one over a sample S — ((xi, zi), . . . , (xr, zr))- 



Theorem 2 If we assume a bounded regression prob- 
lem Vt/, \y\ < B and Vx, ||x|| < R, then the 
Rademacher complexity of the hypothesis set H is 
bounded as follows, 



VIt(H) < (l + 7+(7 + 7 2 )v / d) 



BR 2 



Av/T 



= O 




Due to space constraints, the proof is presented in the 
appendix. Theorem [2] allows us to control the gap 
between empirical and expected risks using standard 
Rademacher complexity results. Theorem 8 of [2], im- 
mediately provides the following corollary. 

Corollary 3 Under the conditions of Theorem 2, for 
any < 5 < 1, with probability at least 1 — 5 over 
samples of size T , every h G H satisfies 

E[(y-Mx\z)) 2 ]<if>-Mx;,z t )) 2 



+ 



T 

BR 2 {1+ 1 ) 2 ( BR 2 {1+ 1 ) 2 
A 




5 Empirical Results 

This section presents empirical evaluation of the on- 
line matrix-based algorithm [SJ as well as the Im- 
puted Ridge Regression algorithm of Section 14.11 
We use baseline methods zero-imputation and mean- 
imputation where the missing entries are replaced with 
zeros and mean estimated from observed values of 
those features resp. Once the data is imputed, a 
standard online gradient descent algorithm or ridge- 
regression algorithm is used. As reference, we also 
show the performance of a standard algorithm on un- 
corrupted data. The algorithms are evaluated on sev- 
eral UCI repository datasets, summarized in Table [TJ 

The thyroid dataset includes naturally cor- 
rupted/missing data. The optdigits dataset is 
subjected to artificial corruption by deleting a column 
of pixels, chosen uniformly at random from the 3 
central columns of the image (each image contains 
8 columns of pixels total). The remainder of the 
datasets are subjected to two types of artificial 
corruption: data-independent or data- dependent 
corruption. In the first case, each feature is randomly 
deleted independently, while the features are deleted 
based on thresholding values in the latter case. 

We report average error and standard deviations over 
5 trials, using 1000 random training examples and cor- 
ruption patterns. We tune hyper-parameters using a 
grid search from 2 -12 to 2 10 . Further details and ex- 
plicit corruption processes appear in the appendix. 



dataset 


m 


d 


F, 


F D 


abalone 


4177 


7 


.62 ± .08 


.61 ± .12 


housing 


20640 


8 


.64 ± .08 


.68 ± .20 


optdigits 


5620 


64 


.88 ± .00 


.88 ± .00 


park 


3000 


20 


.58 ± .06 


.61 ± .08 


thyroid 


3163 


5 


.77 ± .00 


.77 ± .00 


splice 


1000 


60 


.63 ±.01 


.66 ± .03 


wine 


6497 


11 


.63 ±.10 


.69 ± .13 



-no corr — sparse — mean -B-frob — zero 



Table 1: Size of dataset (m), features (d) and, the overall 
fraction of remaining features in the training set after data- 
independent (Fi) or data-dependent (Fd) corruption. 



5.1 Online Corruption Dependent Hypothesis 



Here we analyze the online algorithm presented in sec- 
tion 13.21 using two different types of regularization. 
The first method simply penalizes the Frobenius norm 
of the parameter matrix A (frob-reg), TZ (A) = ||A||^. 
The second method (sparse-reg) forces a sparse solu- 
tion by constraining many entries of the parameter 
matrix equal to zero as mentioned in Section l3.1.4l We 
use the regularizer TZ(A) = 7||A1|| 2 ± ||A|||,, where 7 
is an additional tunable parameter. This choice of reg- 
ularization is based on the example given in equation 
(g|), where we would have ||A1|| = ||w*||. 

We apply these methods to the splice classification 
task and the optdigits dataset in several one vs. all 
classification tasks. For splice, the sparsity pattern 
used by the sparse-reg method is chosen by constrain- 
ing those entries [A]i t j where feature i and j have a cor- 
relation coefficient less than 0.2, as measured with the 
corrupted training sample. In the case of optdigits, 
only entries corresponding to neighboring pixels are 
allowed to be non-zero. 

Figure[T]shows that, when subject to data-independent 
corruption, the zero imputation, mean imputation and 
frob-reg methods all perform relatively poorly while 
the sparse-reg method provides significant improve- 
ment for the splice dataset. Furthermore, we find 
data-dependent corruption is quite harmful to mean 
imputation as might be expected, while both frob-reg 
and sparse-reg still provide significant improvement 
over zero-imputation. More surprisingly, these meth- 
ods also perform better than training on uncorrupted 
data. We attribute this to the fact that we are us- 
ing a richer hypothesis function that is parametrized 
by the corruption vector while the standard algorithm 
uses only a fixed hypothesis. In Table [2] we see that 
the sparse-reg performs at least as well as both zero 
and mean imputation in all tasks and offers signifi- 
cant improvement in the 3-vs-all and 6-vs-all task. In 
this case, the frob-reg method performs comparably to 
sparse-reg and is omitted from the table due to space. 
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Figure 1: 0/1 loss as a function of T for splice dataset 
with independent (top left) and dependent corruption (top 
right). RMSE on abalone across varying amounts of inde- 
pendent (bottom left) and dependent corruption (bottom 
right); fraction of features remaining indicated on x-axis. 





zero-imp mean-imp sparse-reg 


no corr 


2 


.035 ± .002 .039 ± .004 .033 ± .003 


.024 ± .002 


3 


.041 ± .002 .043 ± .001 .039 ± .002 


.027 ± .003 


4 


.020 ± .002 .023 ± .002 .021 ± .001 


.015 ±.001 


6 


.026 ± .002 .024 ± .002 .023 ± .002 


.015 ± .002 



Table 2: One-vs-all classification results on optdigits 
dataset (target digit in first column) with column-based 
corruption for 0/1 loss. 

5.2 Imputed Ridge Regression 

In this section we consider the performance of IRR 
across many datasets. We found standard SDP solvers 
to be quite slow for problem ©. We instead use a 
semi-infinite linear program (SILP) to find an approx- 
imately optimal solution (see e.g. [13] for details). 

In Tables [3] and |4] we compare the performance of the 
IRR algorithm to zero and mean imputation as well as 
to standard ridge regression performance on the un- 
corrupted data. Here we see IRR provides improve- 
ment over zero-imputation in all cases and does at 
least as well as mean-imputation when dealing with 
data-independent corruption. For data-dependent cor- 
ruption, IRR continues to perform well, while mean- 
imputation suffers. For this setting, we have also com- 
pared to an independent-imputation method, which 
imputes data using an M matrix that is trained in- 
dependently of the learning algorithm. In particular 
the ith column of M is selected as the best linear pre- 
dictor of the ith feature given the rest, i.e. the solution 
to: argmin v J^kexA*^ ~ T,j^k} 3 br} 3 ) 2 , where X t 
is the set of training examples that have the ith fea- 
ture present. Although, this method can perform bet- 
ter than mean-imputation, the joint optimization solu- 
tion provided by IRR provides an even more significant 
improvement. At the bottom of Table [4] we also mea- 
sure performance with thyroid which has naturally 
missing values. Here again IRR performs significantly 





zero- imp 


mean-imp 


IRR 


no corr 


A 


.199 ± .004 


.187 ± .003 


.183 ± 


.002 


.158 ± 


.002 


H 


.414 ± .025 


.370 ± .019 


.373 ± 


.019 


.288 ± 


.001 


P 


.457 ± .006 


.445 ± .004 


.451 ± 


.004 


.422 ± 


.004 


W 


.280 ± .006 


.268 ± .009 


.269 ± 


.008 


.246 ± 


.001 



Table 3: RMSE for various imputation methods across the 
datasets abalone (A), housing (H), park (P) and wine (W) 
when subject to data-independent corruption 





mean- 


imp 


ind-imp 


IRR 


no corr 


A 


.180 ± 


.006 


.183 ± 


.012 


.167± 


.011 


.159 ± 


.004 


H 


.400 ± 


.064 


.363 ± 


.041 


.326 ± 


.035 


.289 ± 


.001 


P 


.444 ± 


.008 


.423 ± 


.015 


.377 ± 


.035 


.422 ± 


.001 


W 


.264 ± 


.009 


.260 ± 


.011 


.256 ± 


.011 


.247 ± 


.001 


T 


.531 ± 


.005 


.528 ± 


.003 


.521 ± 


.004 





Table 4: RMSE for various imputation methods across the 
datasets abalone (A), housing (H), park (P) and wine (W) 
when subject to data-dependent corruption. The thyroid 
(T) dataset has naturally occurring missing features. 



better than the competitor methods. Zero-imputation 
is not shown due to space, but it performs uniformly 
worse. Figure [T] shows more detailed results for the 
abalone dataset across different levels of corruption 
and displays the consistent improvement which the 
IRR algorithm provides. 

In Table [S] we see that, with respect to the column- 
corrupted optdigit dataset, the IRR algorithm per- 
forms significantly better than zero-imputation and 
mean-imputation in majority of tasks. 

6 Conclusion 

We have introduced two new algorithms, addressing 
the problem of learning with missing features in both 
the adversarial online and i.i.d. batch settings. The al- 
gorithms are motivated by intuitive constructions and 
we also provide theoretical performance guarantees. 
Empirically we show encouraging initial results for on- 
line matrix-based corruption-dependent hypotheses as 
well as many significant results for the suggested IRR 
algorithm, which indicate superior performance when 
compared to several baseline imputation methods. 
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zero- imp 


mean-imp 


IRR 


no corr 


2 


.352 ± .003 


.351 ± .004 


.346 ± 


.002 


.321 ± 


.003 


3 


.450 ± .005 


.435 ± .004 


.426 ± 


.005 


.398 ± 


.004 


4 
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.363 ± .002 


.364 ± 


.003 


.345 ± 


.002 


6 


.369 ± .003 


.360 ± .002 


.353 ± 


.003 


.333 ± 


.003 



Table 5: RMSE (using binary labels) for one-vs-all classifi- 
cation on optdigits subject to column-based corruption. 
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A Proof of Proposition Q] 



B Proof of Theorem [T] 



The strategy used here is to consider the total regret 
accumulated by an algorithm on several different tasks, 
each one indexed by a different z t . 

First, in order to simplify the interaction between z t 
and x t , suppose only the first d/2 coordinates of x t 
contain information (and the rest are always set equal 
to 0) and assume only the last d/2 coordinates of z t 
contain any information (the rest are always set equal 
to 1). Thus, for every one of the 2 d//2 distinct values 
of Zt we associate a different independent w*(z t ), or 
task, which the algorithm is trying to learn. 

The main intuition is that the learning problem now 
reduces to a multitask classification problem with K = 
2 d / 2 different tasks. Without further assumptions, it 
can be shown that the minimax regret of such a mul- 
titask classification problem is as bad as solving the 
tasks independently. 

We partition the total number of iterations T = 

yii—i where each Ti is the number of iterations 
a particular z t was used by the adversary. In order to 
analyze the minimax regret, we can use von- Neumann 
duality (see e.g. pQ) to get 

inf sup • • • inf sup 

Wl (xi,Zi,J/i) WT (xt,2t,St) 



£; e({w t (z t ), x£>, yt) - M w J2 *«w(z t ), x' 4 ), y t ) 



t=l 

= supE 
p 



E inf E[^((w t (z t ),x' t ),t/ t )|(x s ,y s ,z s )* *] 



i=l 



in LE £ « w ( z *)' x '*)'y*) 



where the supremum is over joint distributions on se- 
quences (xi, t/i, zx), . . . , (x T , yr, zt)- 

It is clear that the first term decomposes over the 2 d ' 2 
tasks (since it decomposes over individual examples). 
The second minimization optimizes over all mappings 
in the set W. This can be done alternatively by maxi- 
mizing over the choice of a weight vector for each task 
individually. As a result, the minimax regret decom- 
poses as a sum of the minimax regrets for each task. 

If we choose Ti = T /2 d l 2 then the total regret (which 
is the sum of the regrets accumulated from each task) 
is measured as follows, 



2<i/2 2 d / 2 

Y^R\T l A) = Y. R *{^mA= 2d/2R * f 1 



i=l i=\ x ' 

This completes the proof of the proposition. 



\2 d l 2 



The proof is standard and just included for complete- 
ness. We recall from the update rule that 

A t+ i = arg miri {r? t (V^(A t ), A) F + D K (A, A t )} . 

AeA 

Consequently, A t+ i satisfies the first order optimality 
conditions: 

(?? t W t (Ai) + Vft(A t+ i) - Vft(Ai), A - A t +i) F > 0, 

(13) 

for all AeA Now for any fixed A 6 A, we can write 
the regret 

T T 

£^ t (A t )-4(A) <£(V4(A t ),A t -A) F 



< [(Vlt(A), A i+1 - A)p + (V£ t (A t ), A t - A t+1 



i=l 
T 



-(V1l(A t+1 ) - Vft(A t ), A - A t+1 ) F 



-(Ve t (A t ),A t -A 



t+1 F 



(14) 



Here the first inequality follows from the convexity of 
the loss it and the last inequality is a consequence 
of (fT3"|) . Also, applying (fT3")) with A = A t gives 

r lt (Vit(At),A t - A t+ i) F 
> (VK(A t+1 ) - VK(A t ), A t+1 - A t ) F 
= Dn(A t , A t+ i) + L> K (A t+ i,A t ) 



> l|A t 



where the last step is a consequence of the strong con- 
vexity of the regularizer 1Z. Finally, applying Holder's 
inequality to the LHS of the above display yields 

II A t - A t+1 || 2 < 77 t (Wt(A t ), A t - A t+1 ) F 
< % ||V4(At)|U||A t -At + i||, 

where || • ||* is the dual norm to || • ||. Hence we get 

||A t - At+i|| <VtG, (15) 

where the last step follows from the Lipschitz assump- 
tion in the theorem statement. As a result, we can 
bound the second term in (fl4"l) as 

(Wt(A t ),At-A t+1 ) F < ||Wt(At)|U|At-At +1 || 
< VtG 2 . (16) 



For the first term in (|T4)) , we note that 

(V^(A + i) - V1l(At), At+i - A)f 

< Dn(A,At)~Dn(At + i,At)-Dn(A,At +1 ) 

< D K (A,A t ) - D K (A,A t+1 ), 



where the last step follows from non-negativity of 
Bregman divergences. Finally, we combine the two 
bounds from above and substitute for the value of 
rj t = R/(GvT). Simplifying yields the statement of 
the theorem. 

C Proof of Proposition [2] 

In order to formulate a tractable problem we first 
rewrite the imputed ridge regression problem in its 
dual formulation. 



minmax 2 > onVi 

M a 

i=l 

T 

E 

s.t. ||M||2,< 7 2 



OHajdxi + Z i M T x i ) T (x j + ZjM T Xj) + ATI) 



The inner maximization problem is concave in a and 
the optimal solution for any fixed M is found via the 
standard closed form solution for ridge regression: 

a* = ((X + ZoMX)(X + ZoMX) T +ATI)- 1 y, 

K M 

where o denotes the component-wise (Hadamard) 
product between matrices and Km will be used to 
denote the Gram matrix containing dot-products be- 
tween imputed training instances. Plugging this solu- 
tion into the minimax problem results in the following 
matrix fractional minimization problem, 

min y(K M + ATI) _1 y, s.t. ||M||| < 7 2 . 

M 

This problem is still not convex in M due to the 
quadratic terms that appear in Km- The main idea 
for the convex relation will be to introduce new vari- 
ables [Nfc]ij which substitute the quadratic terms 
[MJi./JMJ^fc, resulting in a matrix Kmn that is lin- 
ear in terms of the optimization variables M and N/.. 
This is shown precisely below: 



[K 



M\i,j 



xjMZiXj 

ThjT7 rf ««T 



x7Z J M T x J 



x, MZ,Z,M 1 Xj 



E, d , s , fc = i[x,],[x J ] s [z,] fc [z J ] fc [M], |fc [M] s , fe 

,x, + x, Z,M x ; 



[K M n].j = x^xj + x^MZiX,- + x7Z,M T x 



^2[z i ] k [z j ] k xJ'N k x :j 



fc=i 



Note that the matrix Kmn no longer necessarily corre- 
sponds to a Gram matrix and that (Kmn + ATI) may 



no longer be positive semi-definite (which is required 
for the convexity of a matrix fractional problem ob- 
jective). Thus, we add an additional explicit positive 
semi-definiteness constraint resulting in the following 
optimization problem, 

min t 

M,N,t 

s.t. t-y T (K M N + ATI)- 1 y > 
Kmn h 

d 



\M\\% 



<7 2 , ^2\m\\ 2 F<i 4 



fe=i 



where we've additionally added the dummy variable 
t and also constrained the norm of the new [N*,]^- 
variables. The choice of the upper bound is made 
with the knowledge that [N^J^j- replaces the vari- 
ables [MJi /JMJj & and that the bound ||M||^ < 7 im- 

Plies Et-^MlkMl, = Eti(Eti[M] 2 , fc ) 2 < 

(E" fc=1 [M] 2 fc ) 2 <7 4 - 

The constraint involving the dummy variable t is a 
Schur complement and can be replaced with an equiv- 
alent positive semi-definiteness constraint, which re- 
sults in a standard form semidefinite program and 
completes the proof. 

D Proof of Theorem H 

It suffices to bound each of the following terms indi- 
vidually: 



(a) E c 

(b) E c 

(c) E c 

(d) E c 



E/T 
(JiOLjXi Xj 



T 



I ct,M,N 

sup I aiajx'^ Zj'M 1 ^ Xj 



a, M.N 



T 



sup V aia>jx'jM%Xj 

:,M.N 1 ■ , 

1,0 = 1 

T d 

sup Yl Viaj^fiMzjhx'jNkXj 



a,M,N 



fe=i 



then combining the bounds and dividing by T proves 
the theorem. 

To bound (a) we first apply the Cauchy-Schwarz in- 
equality to separate the di and otj terms: 



E CT [sup| ^2 O-iCtjx't Xj 



< sup 1 1 ajXj \\E <T [\\y^a i x' i \ 



The a.i term is bounded as follows, 



of the matrix M, and then Cauchy-Schwarz is applied, 



sup 1 1 a,;Xj 1 1 = sup V a T XX T a 

2 — 1 

<su P |H|||XX T ||2 /2 



B 



BR 



< -^,/Tr[XX T ] < — . 

The Gi term is bounded using the fact that for 
Rademacher independent variables <Ji and <jj the ex- 
pectation E^fTj] = 0. 



E ff [||5>x;||] =E <t [V<tTX'X'T <t ] 



Tr[X'X' T ] < RVT. 



Thus, the first term is bounded as (a) < BR2 ) f r , To 
bound the second term, (6), we again apply Cauchy- 
Schwarz to separate the er and (a,M) terms. The 
er portion is again bounded by RVT and the re- 
mainder of the bound follows similar steps as in the 
bound of (a) decomposing the norm into a product 
between ||a|| and a trace term. If we define [B]^ = 
x 1 MZ 1 Z,M T x j , then 

T 

sup || ajZjM T Xi| = sup (a T Ba) 

' i— 1 

< sup llallA^^B) 1 / 2 < sup ||a|| Tr(B) 1 / 2 

< sup|H|(yx7MZ/Z 4 M T x, ; ) , (17) 



Where the second inequality follows from the fact that 
B is positive semi-definite. Note, since Zi is a diagonal 
0/1 matrix we have Z^Z^ = Z^. To bound the term 
depending on M we use the following set of inequalities 



sup (x^ r MZ s ;M T x l ) V2 < sup f||x i || 2 Tr[MZ i M T ] 



1/2 



1/2 



< supi?f(M,MZ l ) F 

< su P i?f||M||^||MZ,||^y /2 

M v ' 

< supi?||M|| F < jR. 

M 

Thus, the final bound on the second term is (b) < 
i BR _ j n order to separate the er terms from the 
(a,M) terms in the third term, (c), the expression is 
first expanded, using [M] :jS to denote the sth column 



E a,n > x ' KIz ' x - 



= J2 wj^iYlMMixjls) 

= l s=l 

= E(E^^] s x' J ) T (f] a ,[x J ] s [M] : 

s=i 1=1 i=i 

^ (E|E^^u|| 2 ) 1/2 (E||E^-w s [m] : 



s=l i=l 



s=l j=l 



(i) 



(ii) 



The inequality follows from the fact that, given vectors 
v s , u s , we have: 



/2 



= (Eii^ii 2 ) 1/2 (Eii^ii 2 ) 1 



/2 



where the inequality follow from Cauchy-Schwarz. To 
bound the term (i) that depends on er we note 



d T 



s=l i=l 



2 \ 1/2' 



E -[(E||E>&U 

^(EMIlE^HfD 



S=l 1=1 



1/2 



and for any s bound the expectation as follows, 



[||E^U|| 2 ] =^K] 2 ||x; ; || 2 <ri? 2 



Thus, the term (i) is bounded by R\/dT. To bound 
(ii) we again first rewrite the expression in terms of 
1 1 o:|| and a matrix trace (using a similar argument as 
in (H3), but with [B] itj = EtiM.Ix^EtiML), 



then apply the Cauchy-Schwarz inequality, 



set of inequalities, 



d T 



2 \ 1/2 



sup (EIE^^M^ ) 

T d d 

sup ( V a i a j ^[x i ] s [x j ] s y][M] 2 ) 

^ M \ J = 1 s =l r=l 

supllalirf^^M^lM]^) 

rv TVT V . . / 



sup ||V|| F = sup ( V |j Va^z^NfeX,!! 2 ) 

«> N rr ' 



1/2 



a,N 



1/2 



< 



supHl^^lfcxTNjNfcXi) 



1/2 



1/2 



%— 1 s— 1 r— 1 

T d dd 



< 



B 



-p(E(E[ x ^) 1/2 (E(Et M ]-) 2 ) 1/2 ) 

■M • i — i — i „ i 



1/2 



< 



< 



a,N 



fc=l i=l 
T 



sup(^||x,|| 2 ^||N fc |||) 



1/2 



i=l fc=l 



= 1 s=l 



= 1 r=l 



< — == sup 

M 



E(Ew 2 x E m 2 *)) 



1/2 



-1 s — 1 r.s— 1 



F= SU P 

aVt m 



2 Xl/2 7 Bi? 



(Ewi 9 iimi&) ^ 



A 



Combining these two parts gives a bound of (c) < 



~fBFt 2 VdT 



. Let V denote the matrix with k t h column 



equal to Nfc(^\ =1 aj[zj]fcXj), then the bound on (c?) 
follows a similar pattern, first separating a and (N, a) 
using the Cauchy-Schwarz inequality: 



E 



= E„ 



< E, 



sup V a, a.,- y'[z-] fc [z j ] fe x' i r N fc x j 
~j i . , r - , 

— 1 k—l 
T 

sup V^ T Vzj 

T 

sup (^cr 4 x^ T , V)f 



a,N 1 



•[iiE^ Tl 



sup||V|| F . 



The cr term is bounded as follows, 



E, 



[iiE^ 



^(E^^a^Tr^zr^xf]]) 172 



= (Eiwiiw) 1/a <*/E\ 



i=i 



The supremum term is bounded as using the following 



Taking the sum over k results in a bound of (d) < 

..2 D R 2. 

A 



7 BR VdT an j comp i etes the bound. 



E Rademacher Analysis for 
Non-Relaxed Class 

In this section we analyze the generalization per- 
formance of the original (non-relaxed) class of 
imputation-based hypotheses: 

G = |/i(x ,z ) i y w T (x + Z M T x ) 

:||w|| <A,||M|| F < 7 } (18) 

Theorem 4 If we assume a bounded regression prob- 
lem My, \y\ < B and Vx, ||x|| < R, then the 
Rademacher complexity of the hypothesis set Q is 
bounded as follows, 

«t(5) < (1 + "fVd)ARVT . 



Proof: We wish to bound 

T 

M 



E„ 



sup I ^ o-iW T (xi + ZiM T Xi)| 



i=l 
T 



< E„ 



i=i 



sup |^^CT,w T Xj| +E CT sup |^cr i w T Z i M T x., 
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The first term is standard and is bounded by AR^/T. 
We bound the second term by first separating the 
terms depending on w, M and those depending on <r: 

T d T 

^2a^ T Z l M T X l =J2 [^]r\^\e,r{^2<Ti[Ki],[Xi]r) 
i—1 s,r— 1 i—1 

< ( j: wimi^ ( (E^w.Wr) a ) 4 , 
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where the inequality follows from the Cauchy-Schwartz 
inequality. We bound the first term using the fact 
that every term in the sum is positive and adding an 
additional summation: 

d d i 

(a) < ((£>]?)( £ [M]y) 5 = ||w||||M|| F < A 7 . 

r—l r,s=l 

The second term is bounded by first applying 
Jensen's inequality and then the property that 
E <r[(Z)i °ia-i) 2 ] = Y^i a i for an y constants a t . 

d T i 

(b)< (x; e ct [(x^w,[z^) 2 ])" 

s,r— 1 4—1 

= (EEw?[^)"<(Enii 2 ii^ii 2 )" 

i—l s,r— 1 2—1 

< i?v / dT. 

Combining the two bounds and dividing by T proves 
the theorem. ■ 



F Experimental Setup 

We start by explaining corruption-dependent and 
corruption-independent corruption processes. In the 
former case a probability of corruption is chosen uni- 
formly at random from [0,/3] for each feature (where 
/? can be tuned to induce more or less corruption), 
independent of the other features and independent of 
data instance (i.e. z t is independent of x t ). In the 
latter random threshold is chosen for each 

feature uniformly between [0,1] as well as a sign <7fc 
chosen uniformly from {—1,1}. Then, if a feature sat- 
isfies crQxiJfc — Tfc) > 0, it is deleted with probability 
/3, which again can be tuned to induce more or less 
missing data. Table []] shows the average fraction of 
features remaining after being subject to each type of 
corruption; that is, the total sum of features available 
over all instances divided by the total number of fea- 
tures that would be available in the corruption-free 
case. 

The average error-rate along with one standard de- 
viation is reported over 5 trials each with a random 
fold of 1000 training pointfl In the batch setting 
the remainder of the dataset in each trial is used as 
the test set. When applicable, each trial is also sub- 
jected to a different random corruption pattern. All 
scores are reported with respect to the best perform- 
ing parameters, A, C, r\ and 7, tuned across the values 
ro-12 9-11 9101 



With the exception of optdigits in the online setting 
where 3000 points are used. 



